In [1]:
%matplotlib inline
In [2]:
from pprint import pprint
import matplotlib.pyplot as plt
A Corpus is a collection of Papers with superpowers. Most importantly, it provides a consistent way of indexing bibliographic records. Indexing is important, because it sets the stage for all of the subsequent analyses that we may wish to do with our bibliographic data.
In 1. Loading Data, part 1 we used the read function in tethne.readers.wos to parse a collection of Web of Science field-tagged data files and build a Corpus.
In [11]:
from tethne.readers import wos
datapath = '/Users/erickpeirson/Downloads/datasets/wos'
corpus = wos.read(datapath)
In this notebook, we'll dive deeper into the guts of the Corpus, focusing on indexing and and features.
index_byThe primary indexing field is the field that Tethne uses to identify each of the Papers in your dataset. Ideally, each one of the records in your bibliographic dataset will have this field. Good candidates include DOIs, URIs, or other unique identifiers.
Depending on which module you use, read will make assumptions about which field to use as the primary index for the Papers in your dataset. The default for Web of Science data, for example, is 'wosid' (the value of the UT field-tag).
In [5]:
print 'The primary index field for the Papers in my Corpus is "%s"' % corpus.index_by
The primary index for your Corpus can be found in the indexed_papers attribute. indexed_papers is a dictionary that maps the value of the indexing field for each Paper onto that Paper itself.
In [7]:
corpus.indexed_papers.items()[0:10] # We'll just show the first ten Papers, for the sake of space.
Out[7]:
So if you know (in this case) the wosid of a Paper, you can retrieve that Paper by passing the wosid to indexed_papers:
In [8]:
corpus.indexed_papers['WOS:000321911200011']
Out[8]:
If you'd prefer to index by a different field, you can pass the index_by parameter to read.
In [12]:
otherCorpus = wos.read(datapath, index_by='doi')
In [13]:
print 'The primary index field for the Papers in this other Corpus is "%s"' % otherCorpus.index_by
If some of the Papers lack the indexing field that you specified with the index_by parameter, Tethne will automatically generate a unique identifier for each of those Papers. For example, in our otherCorpus that we indexed by doi, most of the papers have valid DOIs, but a few (#1, below) did not -- a nonsensical-looking sequence of alphanumeric characters was used instead.
In [15]:
i = 0
for doi, paper in otherCorpus.indexed_papers.items()[0:10]:
print '(%i) DOI: %s \t ---> \t Paper: %s' % (i, doi.ljust(30), paper)
i += 1
In [16]:
print 'The following Paper fields have been indexed: \n\n\t%s' % '\n\t'.join(corpus.indices.keys())
The 'citations' index, for example, allows us to look up all of the Papers that contain a particular bibliographic reference:
In [18]:
for citation, papers in corpus.indices['citations'].items()[7:10]: # Show the first three, for space's sake.
print 'The following Papers cite %s: \n\n\t%s \n' % (citation, '\n\t'.join(papers))
Notice that the values above are not Papers themselves, but identifiers. These are the same identifiers used in the primary index, so we can use them to look up Papers:
In [20]:
papers = corpus.indices['citations']['CARLSON SM 2004 EVOL ECOL RES'] # Who cited Carlson 2004?
print papers
for paper in papers:
print corpus.indexed_papers[paper]
We can create new indices using the index method. For example, to index our Corpus using the authorKeywords field:
In [22]:
corpus.index('authorKeywords')
In [25]:
for keyword, papers in corpus.indices['authorKeywords'].items()[6:10]: # Show the first three, for space's sake.
print 'The following Papers contain the keyword %s: \n\n\t%s \n' % (keyword, '\n\t'.join(papers))
Since we're interested in historical trends in our Corpus, we probably also want to index the date field:
In [27]:
corpus.index('date')
for date, papers in corpus.indices['date'].items()[-11:-1]: # Last ten years.
print 'There are %i Papers from %i' % (len(papers), date)
We can examine the distribution of Papers over time using the distribution method:
In [29]:
corpus.distribution()[-11:-1] # Last ten years.
Out[29]:
In [30]:
plt.figure(figsize=(10, 3))
start = min(corpus.indices['date'].keys())
end = max(corpus.indices['date'].keys())
X = range(start, end + 1)
plt.plot(X, corpus.distribution(), lw=2)
plt.ylabel('Number of Papers')
plt.xlim(start, end)
plt.show()
In [31]:
corpus['WOS:000309391500014']
Out[31]:
Whoa! But it gets better. We can select Papers using any of the indices in the Corpus. For example, we can select all of the papers with the authorKeyword LIFE:
In [33]:
corpus[('authorKeywords', 'LIFE')]
Out[33]:
We can also select Papers using several values. For example, with the primary index field:
In [34]:
corpus[['WOS:000309391500014', 'WOS:000306532900015']]
Out[34]:
...and with other indexed fields (think of this as an OR search):
In [36]:
corpus[('authorKeywords', ['LIFE', 'ENZYME GENOTYPE', 'POLAR AUXIN'])]
Out[36]:
Since we indexed 'date' earlier, we could select any Papers published between 2011 and 2012:
In [38]:
papers = corpus[('date', range(2002, 2013))] # range() excludes the "last" value.
print 'There are %i Papers published between %i and %i' % (len(papers), 2002, 2012)
Earlier we used specific fields in our Papers to create indices. The inverse of an index is what we call a FeatureSet. A FeatureSet contains data about the occurrence of specific features across all of the Papers in our Corpus.
The read method generates a few FeatureSets by default. All of the available FeatureSets are stored in a dictionary, the features attribute.
In [39]:
corpus.features.items()
Out[39]:
Each FeatureSet has several properties:
FeatureSet.index maps integer identifiers to specific features. For example, for author names:
In [40]:
featureset = corpus.features['authors']
for k, author in featureset.index.items()[0:10]:
print '%i --> "%s"' % (k, ', '.join(author)) # Author names are stored as (LAST, FIRST M).
FeatureSet.lookup is the reverse of index: it maps features onto their integer IDs:
In [42]:
featureset = corpus.features['authors']
for author, k in featureset.lookup.items()[0:10]:
print '%s --> %i' % (', '.join(author).ljust(25), k)
FeatureSet.documentCounts shows how many Papers in our Corpus have a specific feature:
In [43]:
featureset = corpus.features['authors']
for k, count in featureset.documentCounts.items()[0:10]:
print 'Feature %i (which identifies author "%s") is found in %i documents' % (k, ', '.join(featureset.index[k]), count)
FeatureSet.features shows how many times each feature occurs in each Paper.
In [44]:
featureset.features.items()[0]
Out[44]:
We can create a new FeatureSet from just about any field in our Corpus, using the index_feature method. For example, suppose that we were interested in the distribution of authorKeywords across the whole corpus:
In [46]:
corpus.index_feature('authorKeywords')
corpus.features.keys()
Out[46]:
In [48]:
featureset = corpus.features['authorKeywords']
for k, count in featureset.documentCounts.items()[0:10]:
print 'Keyword %s is found in %i documents' % (featureset.index[k], count)
In [49]:
featureset.features['WOS:000324532900018'] # Feature for a specific Paper.
Out[49]:
In [50]:
plt.figure(figsize=(10, 3))
years, values = corpus.feature_distribution('authorKeywords', 'DIVERSITY')
start = min(years)
end = max(years)
X = range(start, end + 1)
plt.plot(years, values, lw=2)
plt.ylabel('Papers with DIVERSITY in authorKeywords')
plt.xlim(start, end)
plt.show()